Usually when dealing with an unsupervised learning problem, its difficult to get a good measure of how well the model performed. For this project, we will use data from the UCI archive based off of red and white wines (this is a very commonly used data set in ML).
We will then add a label to the a combined data set, we'll bring this label back later to see how well we can cluster the wine into groups.
Download the two data csv files from the UCI repository (or just use the downloaded csv files).
Use read.csv to open both data sets and set them as df1 and df2. Pay attention to what the separator (sep) is.
Now add a label column to both df1 and df2 indicating a label 'red' or 'white'.
Check the head of df1 and df2.
Combine df1 and df2 into a single data frame called wine.
str(wine)
Let's explore the data a bit and practice our ggplot2 skills!
Create a Histogram of residual sugar from the wine data. Color by red and white wines.
Create a Histogram of citric.acid from the wine data. Color by red and white wines.
Create a Histogram of alcohol from the wine data. Color by red and white wines.
Create a scatterplot of residual.sugar versus citric.acid, color by red and white wine.
Create a scatterplot of volatile.acidity versus residual.sugar, color by red and white wine.
Feel free to explore the data as you see fit, we'll go ahead and move on!
Grab the wine data without the label and call it clus.data
Check the head of clus.data
Call the kmeans function on clus.data and assign the results to wine.cluster.
Print out the wine.cluster Cluster Means and explore the information.
You usually won't have the luxury of labeled data with KMeans, but let's go ahead and see how we did!
Use the table() function to compare your cluster results to the real results. Which is easier to correctly group, red or white wines?